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Abstract 

Spatial  database  research  has  continued  to  advance  greatly  since  three  decades  ago,  addressing  the 
growing  data  management  and  analysis  needs  of  spatial  applications.  This  research  has  produced  a 
taxonomy  of  models  for  space,  conceptual  models,  spatial  query  languages  and  query  processing,  spatial 
file  organization  and  indexes,  and  spatial  data  mining.  However,  emerging  needs  for  spatial  database 
systems  include  the  handling  of  3D  spatial  data,  temporal  dimension  with  spatial  data,  and  spatial  data 
visualization.  In  addition,  the  rise  of  new  systems  such  as  sensor  networks  and  multi-core  processors  is 
likely  to  have  an  impact  in  spatial  databases.  The  goal  of  this  paper  is  to  provide  a  broad  overview  of 
the  recent  advancements  in  spatial  databases  and  research  needs  in  each  area. 


1  Introduction 

Spatial  database  management  systems  [63,  80,  99,  128,  154,  155]  aim  at  the  effective  and  efficient  management 
of  data  related  to 

•  space  in  the  physical  world  (geography,  urban  planning,  astronomy,  human  anatomy,  fluid  flow  or  an 
electromagnetic  field) ; 

•  biometrics  (fingerprints,  palm  measurements,  facial  patterns); 

•  engineering  design  (very  large  scale  integrated  circuits,  layout  of  a  building,  or  the  molecular  structure 
of  a  pharmaceutical  drug);  and 
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•  conceptual  information  space  (virtual  reality  environments,  multidimensional  decision  support  sys¬ 
tems). 

A  Spatial  Database  Management  System  (SDBMS)  can  be  characterized  as  follows: 

•  A  SDBMS  is  a  software  module  that  can  work  with  an  underlying  database  management  system,  for 
example,  an  Object-Relational  database  management  system,  or  Object-oriented  database  management 
system. 

•  SDBMSs  support  multiple  spatial  data  models,  commensurate  spatial  abstract  data  types  (ADTs), 
and  a  query  language  from  which  these  ADTs  are  callable. 

•  SDBMSs  support  spatial  indexing,  efficient  algorithms  for  spatial  operations,  and  domain-specific  rules 
for  query  optimization. 

Spatial  database  research  has  been  an  active  area  for  several  decades.  The  results  of  this  research  are 
being  used  in  a  number  of  areas.  To  cite  a  few  examples,  the  filter-and-refine  technique  used  in  spatial  query 
processing  has  been  applied  to  subsequence  mining;  multidimensional-index  structures  such  as  R-tree  and 
Quad-tree  used  in  accessing  spatial  data  are  applied  in  the  field  of  computer  graphics  and  image  processing; 
and  space-filling  curves  used  in  spatial  query  processing  and  data  storage  are  applied  in  dimension  reduction 
problems.  The  field  of  spatial  databases  can  be  defined  by  its  accomplishments;  current  research  is  aimed  at 
improving  its  functionality,  extensibility,  and  performance.  The  impetus  for  improving  functionality  comes 
from  the  needs  of  existing  application  such  as  Geographic  Information  Systems  (GIS),  Location  Based 
Services  (LBS)  [122],  sensor  Networks  [140],  ecology  and  environmental  management  [120],  public  safety, 
transportation  [88],  Earth  science,  epidemiology  [48],  crime  analysis  [91],  and  climatology. 

Commercial  examples  of  spatial  database  management  include  ESRI’s  ArcGIS  Geodatabase  [24] ,  Oracle 
Spatial  [32],  IBM’s  DB2  Spatial  Extender  and  Spatial  Datablade,  and  future  systems  such  as  Microsoft’s 
SQL  Server  2008  (code-named  Katmai)  [79].  Spatial  databases  have  played  a  major  role  in  the  commercial 
industry  such  as  Google  Earth  [59]  and  Microsoft’s  Virtual  Earth  [100].  Research  prototype  examples  of 
spatial  database  management  systems  include  spatial  datablades  with  PostGIS  [111],  MySQL’s  Spatial  Ex¬ 
tensions  [103],  Sky  Server  [17]  and  spatial  extensions.  The  functionalities  provided  by  these  systems  include 
a  set  of  spatial  data  types  such  as  a  points,  line-segments  and  polygons,  and  a  set  of  spatial  operations  such  as 
inside,  intersection,  and  distance.  The  spatial  types  and  operations  may  be  made  a  part  of  a  query  language 
such  as  SQL,  which  allows  spatial  querying  when  combined  with  an  object-relational  database  management 
system  [42,  141].  The  performance  enhancement  provided  by  these  systems  includes  a  multi-dimensional 
spatial  index  and  algorithms  for  spatial  database  modeling  such  as  OGIS  [107]  and  3D  Topological  rnodel- 
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ing;  spatial  query  processing  including  point,  regional,  range,  and  nearest  neighbor  queries;  and  spatial  data 
methods  using  a  variety  of  indexes  such  as  quad  trees  and  grid  cells. 

1.1  Related  Work  and  Our  Contributions 

Published  work  related  to  spatial  databases  can  broadly  be  classified  as  follows: 

•  Textbooks  [128,  155,  112,  99],  which  explain  in  detail  various  topics  in  spatial  databases  such  as  logical 
data  models  for  spatial  data,  algorithms  for  spatial  operations,  and  spatial  data  access  methods.  Recent 
textbooks  [155,  62]  deal  with  research  trends  in  spatial  databases  such  as  spatio-temporal  databases, 
and  moving  objects  databases. 

•  Reference  books  [133,  119],  which  are  useful  for  studying  areas  related  to  spatial  databases,  for  example, 
multidimensional  data  structures,  and  Geographic  Information  Systems  (GIS). 

•  Journals  and  conference  proceedings  [1,  2,  3,  4,  5,  6,  8,  7,  9,  11],  which  are  a  source  of  in-depth  technical 
knowledge  of  specific  problem  areas  in  spatial  databases. 

•  Research  surveys  [138,  63,  19],  which  summarize  key  accomplishments  and  identify  research  needs  in 
various  areas  of  spatial  databases  at  that  time. 

Spatial  database  research  has  continued  to  advance  greatly  since  the  last  survey  papers  in  this  area  were 
published  [138,  63,  19].  Our  contribution  in  this  chapter  is  to  summarize  the  most  recent  accomplishments  in 
spatial  database  research,  a  number  of  which  were  identified  as  research  needs  in  earlier  surveys.  For  instance, 
bulk  loading  techniques  and  spatial  join  strategies  are  rereferenced  here  as  well  as  other  advances  in  spatial 
data  mining  and  conceptual  modeling  of  spatial  data.  In  addition,  this  chapter  provides  an  extensive  updated 
list  of  research  needs  in  such  areas  as  management  of  3D  spatial  data,  visibility  queries,  and  many  others. 
The  bibliography  section  at  the  end  of  this  chapter  contains  a  list  of  over  100  references,  updated  with  the 
latest  achievements  in  spatial  databases. 

1.2  Scope  and  Outline 

The  goal  of  this  chapter  is  to  provide  the  reader  with  a  broad  introduction  to  spatial  database  systems. 
Spatial  databases  are  discussed  in  the  context  of  object-relational  databases  [42,  141,  143],  which  provide 
extensibility  to  many  components  of  traditional  databases  to  support  the  spatial  domain.  Three  major 
areas  that  receive  attention  in  the  database  context  -  conceptual,  logical  and  physical  data  models  -  are 
discussed(see  Table  1).  In  addition,  applications  of  spatial  data  for  spatial  data  mining  are  also  explored. 
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Emerging  needs  for  spatial  database  systems  include  the  handling  of  3D  spatial  data,  spatial  data  with 
temporal  dimension,  and  effective  visualization  of  spatial  data.  The  emergence  of  hardware  technology  such 
as  Storage  Area  Networks  and  the  availability  of  multi-core  processors  are  two  additional  fields  likely  to  have 
an  impact  on  spatial  databases.  Such  topics  of  research  interest  are  introduced  at  the  end  of  each  section. 
References  are  provided  for  further  exploration. 

The  rest  of  this  chapter  is  organized  as  follows:  Fundamental  concepts  helpful  to  understand  spatial 
databases  are  presented  in  section  2.  Sections  3  and  4  describe  spatial  database  modeling  at  the  conceptual 
and  logical  levels.  Techniques  for  spatial  query  processing  are  discussed  in  section  5.  File  organizations  and 
index  data  structures  are  presented  in  section  6.  Spatial  data  mining  patterns  and  techniques  are  explored 
in  section  7. 


2  Mathematical  Framework 


2.1  Accomplishments 

Spatial  data  are  relatively  more  complex  compared  with  traditional  business  data.  Specific  features  of  spatial 
data  include:  i)  rich  data  types  (e.g.,  extended  spatial  objects),  ii)  implicit  spatial  relationships  among  the 
variables,  iii)  observations  that  are  not  independent,  and  iv)  spatial  autocorrelation  among  the  features. 

Spatial  data  can  be  considered  to  have  two  types  of  attributes:  non-spatial  attributes  and  spatial  at¬ 
tributes.  Non-spatial  attributes  are  used  to  characterize  non-spatial  features  of  objects,  such  as  name, 
population,  and  unemployment  rate  for  a  city.  Spatial  attributes  are  used  to  define  the  spatial  location 
and  extent  of  spatial  objects  [35].  The  spatial  attributes  of  a  spatial  object  most  often  include  information 
related  to  spatial  locations,  e.g.,  longitude,  latitude,  elevation,  as  well  as  shape.  Relationships  among  non- 
spatial  objects  are  explicit  in  data  inputs,  e.g.,  arithmetic  relation,  ordering,  instance  of,  subclass  of,  and 
membership  of.  In  contrast,  relationships  among  spatial  objects  are  often  implicit,  such  as  overlap,  intersect, 
and  behind. 

Space  is  a  framework  to  formalize  specific  relationships  among  a  set  of  objects.  Depending  on  the 
relationships  of  interest,  different  models  of  space  such  as  set-based  space,  topological  space,  Euclidean 


Mathematical  Framework 

Conceptual  Data  Model 

Logical  Data  Model 

Query  Languages 

Query  Processing 

File  Organizations  and  Indices 


Trends:  Spatial  Data  Mining 


Table  1:  Spatial  Database  topics 
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space,  metric  space  and  network  space  can  be  used  [155].  Set-based  space  uses  the  basic  notion  of  elements, 
element-equality,  sets,  and  membership  to  formalize  the  set  relationships  such  as  set-equality,  subset,  union, 
cardinality,  relation,  function,  and  convexity.  Relational  and  object-relational  databases  use  this  model  of 
space. 

Topological  space  uses  the  basic  notion  of  a  neighborhood  and  points  to  formalize  the  extended  object 
relations  such  as  boundary,  interior,  open,  closed,  within,  connected,  and  overlaps,  which  are  invariant 
under  elastic  deformation.  Combinatorial  topological  space  formalizes  relationships  such  as  Euler’s  formula 
(number  of  faces  +  number  of  vertices  -  number  of  edges  =  2  for  planar  configuration).  Network  space  is  a 
form  of  topological  space  in  which  the  connectivity  property  among  nodes  formalizes  graph  properties  such 
as  connectivity,  isomorphism,  shortest-path,  and  planarity. 

Euclidean  coordinatized  space  uses  the  notion  of  a  coordinate  system  to  transform  spatial  properties 
and  relationships  to  properties  of  tuples  of  real  numbers.  Metric  spaces  formalize  the  distance  relationships 
using  positive  symmetric  functions  that  obey  the  triangle  inequality.  Many  multidimensional  applications 
use  Euclidean  coordinatized  space  with  metrics  such  as  distance. 

2.2  Research  Needs 

Many  spatial  applications  manipulate  continuous  spaces  of  different  scales  and  with  different  levels  of  dis¬ 
cretization.  A  sequence  of  operations  on  discretized  data  can  lead  to  growing  errors  similar  to  the  ones 
introduced  by  finite-precision  arithmetic  on  numbers.  There  are  preliminary  results  [63]  on  the  use  of  dis¬ 
crete  basis  and  bounding  errors  with  peg-board  semantics.  Another  related  problem  concerns  interpolation 
to  estimate  the  continuous  field  from  a  discretization.  Negative  spatial  autocorrelation  makes  interpolation 
error-prone.  Further  work  is  needed  on  a  framework  to  formalize  the  discretization  process,  its  associated 
errors,  and  on  interpolation. 

3  Spatial  Database  Conceptual  Modeling 

3.1  Accomplishments 

Entity  Relationship  (ER)  diagrams  are  commonly  used  in  designing  the  conceptual  model  of  a  database. 
Many  extensions  [65]  have  been  proposed  to  extend  ER  to  make  the  conceptual  modeling  of  spatial  ap¬ 
plications  easier  and  more  intuitive.  One  such  extension  is  the  use  of  pictograms  [132].  A  pictogram  is 
a  graphical  icon  that  can  represent  a  spatial  entity  or  a  spatial  relationship  between  spatial  entities.  The 
idea  is  to  provide  constructs  to  capture  the  semantics  of  spatial  applications  and  at  the  same  time  to  keep 
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the  graphical  representation  simple.  Figure  2  provides  different  types  of  pictograms  for  spatial  entities  and 
relationships.  In  the  following  text  we  define  pictograms  to  represent  spatial  entities  and  relationships,  and 
their  grammar  in  graphical  form. 

Pictogram:  A  pictogram  is  a  representation  of  the  object  inserted  inside  of  a  box.  These  iconic  represen¬ 
tations  are  used  to  extend  ER  diagrams  and  are  inserted  at  appropriate  places  inside  the  entity  boxes.  An 
entity  pictogram  can  be  of  a  basic  shape  or  a  user-defined  shape. 

Shape:  Shape  is  the  basic  graphical  element  of  a  pictogram  that  represents  the  geometric  types  in  the 
spatial  data  model.  It  can  be  a  basic  shape,  a  multishape,  a  derived  shape,  or  an  alternate  shape.  Most 
objects  have  simple  (basic)  shapes  (Figure  1  B). 

Basic  Shape:  In  a  vector  model  the  basic  elements  are  point,  line  and  polygon.  In  a  forestry  example, 
the  user  may  want  to  represent  a  facility  as  a  point  (0-D),  a  river  or  road  network  as  lines  (1-D)  and  forest 
areas  as  polygons  (2-D)  (Figure  1  D). 

Multi-Shape:  To  deal  with  objects  which  cannot  be  represented  by  the  basic  shapes,  we  can  use  a  set 
of  aggregate  shapes.  Cardinality  is  used  to  quantify  multi-shapes.  For  example,  a  river  network  that  is 
represented  as  a  line  pictogram  scale  will  have  cardinality  0  (Figure  1  B  and  E). 

Derived  Shape:  If  the  shape  of  an  object  is  derived  from  the  shapes  of  other  objects,  its  pictogram 
is  italicized.  For  example,  we  can  derive  a  forest  boundary  (polygon)  from  its  “forest  type”  boundaries 
(polygon),  or  a  country  boundary  from  the  constituent  state  boundaries  (Figure  1  C  and  G). 

Alternate  Shape:  Alternate  shapes  can  be  used  for  the  same  object  depending  on  certain  conditions; 
for  example,  objects  of  size  less  than  x  units  are  represented  as  points  while  those  greater  than  x  units 
are  represented  as  polygons.  Alternate  shapes  are  represented  as  a  concatenation  of  possible  pictograms. 
Similarly,  multiple  shapes  are  needed  to  represent  objects  at  different  scales;  for  example,  at  higher  scales 
lakes  may  be  represented  as  points,  and  at  lower  scales  as  polygons  (Figure  1  D  and  H). 

Any  Possible  Shape:  A  combination  of  shapes  is  represented  by  a  wild  card  *  symbol  inside  a  box, 
implying  that  any  geometry  is  possible  (Figure  1  E). 

User-Defined  Shape:  Apart  from  the  basic  shapes  of  point,  line  and  polygon,  user-defined  shapes  are 
possible.  User-defined  shapes  are  represented  by  an  exclamation  symbol  (!)  inside  a  box  (Figure  1  A). 

Relationship  Pictograms:  Relationship  pictograms  are  used  to  model  the  relationship  between  entities. 
For  example,  part-of  is  used  to  model  the  relationship  between  a  route  and  a  network,  or  it  can  be  used  to 
model  the  partition  of  a  forest  into  forest  stands  (Figure  1  C). 

The  popularity  of  object-oriented  languages  such  as  C++  and  Java  has  encouraged  the  growth  of  object- 
oriented  database  systems  (OODBMS).  The  motivation  behind  this  growth  in  OODBMS  is  that  the  direct 
mapping  of  the  conceptual  database  schema  into  an  object-oriented  language  leads  to  a  reduction  of  im- 
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Figure  1:  Pictograms 


pedance  mismatch  encountered  when  a  model  on  one  level  is  converted  into  a  model  on  another  level. 

UML  is  one  of  the  standards  for  conceptual  level  modeling  for  object-oriented  software  design.  It  may 
also  be  applied  to  an  OODBMS  to  capture  the  design  of  the  system  conceptually.  A  UML  design  consists 
of  the  following  building  blocks. 

Class:  A  class  is  the  encapsulation  of  all  objects  which  share  common  properties  in  the  context  of  the 
application.  It  is  the  equivalent  of  the  entity  in  the  ER  model.  The  class  diagrams  in  a  UML  design  can  be 
further  extended  by  adding  pictograms.  In  a  forestry  example,  classes  can  be  Forest,  Facility,  Forest  Stand, 
etc. 

Attributes:  Attributes  characterize  the  objects  of  the  class.  The  difference  between  an  attribute  ER  and  a 
UML  model  design  is  that  there  is  no  notion  of  a  key  attribute  in  UML.  This  is  because  in  an  object-oriented 
system,  each  object  has  an  implicit  system-generated  unique  identification.  In  UML,  attributes  also  have 
a  scope  that  restricts  the  attributes  access  by  other  classes.  There  are  three  levels  of  scope,  and  each  has 
a  special  symbol:  +  Public:  This  allows  the  attribute  to  be  accessed  and  manipulated  from  any  class.  - 
Private:  Only  the  class  that  owns  the  attribute  is  allowed  to  access  the  attribute.  #  Protected:  Other  than 
the  class  that  owns  the  attribute,  classes  derived  from  the  class  that  owns  can  access  the  attibute. 

Methods:  Methods  are  functions  and  a  part  of  class  definition.  They  are  responsible  for  modifying  the 
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behavior,  or  state  of  the  class.  The  state  of  the  class  is  embodied  in  the  current  values  of  the  attributes.  In 
object-oriented  design,  attributes  should  only  be  accessed  through  methods. 

Relationships:  Relationships  relate  one  class  to  another  or  to  itself.  This  is  similar  to  the  concept  of 
relationship  in  the  ER  model.  There  are  three  important  categories  of  relationships: 


•  Aggregation:  This  is  a  specific  construct  to  capture  the  part-whole  relationship.  For  instance,  a  group 
of  Forest-Stand  classes  may  be  aggregated  into  a  Forest  class. 

•  Generalization:  This  is  a  relationship  in  which  a  child  class  can  be  generalized  to  a  parent  class.  For 
example,  classes  such  as  Point,  Line  and  Polygon  can  be  generalized  to  a  Geometry  class. 

•  Association:  This  shows  how  objects  of  different  classes  are  related.  An  association  is  binary  if  it 
connects  two  classes  or  ternary  if  it  connects  three  classes.  An  example  of  a  binary  association  is 
supplies jwater -to  between  the  classes  River  and  Facility. 

Figures  2  and  3  provide  an  example  for  modeling  a  State-Park  using  ER  and  UML  with  pictograms, 
respectively. 

3.2  Research  Needs 

Conceptual  modeling  for  spatio-temporal  and  moving-object  data  needs  to  be  researched.  Pictograms  as 
introduced  in  this  section  may  be  extended  to  handle  such  data.  Models  used  in  the  spatial  representation  of 
data  can  be  extended  to  conside  the  time  dimension.  For  instance,  the  9-intersection  matrix  used  to  represent 
topology  can  be  differentiated  to  consider  the  change  in  topology  over  a  period  of  time.  Similarly,  other 
spatial  properties  such  as  position,  orientation  and  shape  can  be  differentiated  to  consider  effects  over  time 
such  as  motion,  rotation,  and  deformation  of  a  spatial  object.  Similarly,  series  of  points  can  be  accumulated 
to  represent  time-varying  spatial  data  and  properties. 

Another  area  of  research  is  the  use  of  ontology  for  knowledge  management.  An  ontology  defines  a  common 
vocabulary  that  allows  knowledge  to  be  shared  and  reused  across  different  applications.  Ontologies  provide  a 
shared  and  common  understanding  of  some  domain  that  can  be  communicated  across  people  and  computers. 
Geospatial  ontology  [54]  is  specific  to  the  geospatial  domain.  Research  in  geospatial  ontology  is  needed 
to  provide  interoperability  between  geospatial  data  and  software.  Developing  geospatial  ontologies  is  one 
of  the  long-term  research  challenges  for  the  University  Consortium  for  Geographic  Information  Systems 


Figure  2:  Example  ER  diagram  with  pictograms 


(UCGIS)  [11].  Research  in  this  area  is  also  being  carried  out  by  companies  such  as  CYC  for  geospatial 
ontology. 

Geospatial  ontology  can  be  further  extended  to  include  the  temporal  dimension.  The  ontology  of  time  has 
been  researched  in  the  domain  of  artificial  intelligence  as  situation  calculus.  OWL-Time  [14]  is  an  ontology 
developed  to  represent  time. 

Semantic  web  [34]  is  widely  known  as  an  efficient  way  to  represent  data  on  the  world  wide  web.  The  wealth 
of  geographic  information  currently  available  on  the  web  has  prompted  research  in  the  area  of  GeoSpatial 
Semantic  Web  [46,  53].  In  this  context,  it  becomes  necessary  to  create  representations  of  the  geographic 
information  resources.  This  must  lead  to  a  framework  for  information  retrieval  based  on  the  semantics 
of  spatial  ontologies.  Developing  the  geo-spatial  ontology  that  is  required  in  a  geo-spatial  semantic  web 
is  challenging  because  the  defining  properties  of  geographic  entities  are  very  closely  related  to  space.  In 
addition,  each  entity  may  have  several  sub-entities  resulting  into  a  complex  object  [53].  One  popular  data 
model  used  in  representing  Semantic  web  is  the  Resource  Description  Framework  (RDF)  [151].  RDF  is 
being  extended  (GeoRDF)  [148]  to  include  spatial  dimensions  and  hence  to  provide  the  necessary  support 
for  geographica  data  on  the  web. 
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Figure  3:  Example  UML  class  diagram  with  pictograms 


4  Spatial  Data  Models  and  Query  Languages 

4.1  Accomplishments 

Data  Models  A  spatial  data  model  provides  the  data  abstraction  necessary  to  hide  the  details  of  data 
storage.  The  two  commonly  used  models  are  the  field-based  model  and  the  object-based  model.  While  the 
field-based  model  adopts  a  functional  viewpoint,  object-based  models  treat  the  information  space  as  a  collec¬ 
tion  of  discrete,  identifiable,  spatially-referenced  entities.  Based  on  the  type  of  data  model  used,  the  spatial 
operations  may  change.  Table  2  lists  the  operations  specific  to  the  field-based  and  object-based  models.  In 
the  context  of  object-relational  databases,  a  spatial  data  model  is  implemented  using  a  set  of  spatial  data 
types  and  operations.  Over  the  last  two  decades,  an  enormous  amount  of  work  has  been  done  in  the  design 
and  development  of  spatial  abstract  data  types  and  their  embedding  in  a  query  language.  Serious  efforts  are 
being  made  to  arrive  at  a  consensus  on  standards  through  the  OGIS  consortium  [75]. 

OGIS  proposed  the  general  feature  model  [75]  where  features  are  considered  to  occur  at  two  levels,  namely, 
feature  instances  and  feature  types.  A  geographic  feature  is  represented  as  a  discrete  phenomenon  charac¬ 
terized  by  its  geographic  and  temporal  coordinates  at  the  instance  level,  and  the  instances  with  common 
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Data  Model 

Operator  Group 

Operation 

Vector  Object 

Set- Oriented 

equals,  is  a  member  of,  is  empty,  is  a  subset  of,  is  disjoint 
from,  intersection,  union,  difference,  cardinality 

Topological 

boundary,  interior,  closure,  meets,  overlaps,  is  inside,  covers, 
connected,  components,  extremes,  is  within 

Metric 

distance,  bearing/ angle,  length,  area,  perimeter 

Direction 

east,  north,  left,  above,  between 

Network 

successors,  ancestors,  connected,  shortest-path 

Dynamic 

translate,  rotate,  scale,  shear,  split,  merge 

Raster  Field 

Local 

Point- wise  sums,  differences,  maximums,  means,  etc 

Focal 

slope,  aspect,  weighted  average  of  neighborhood 

Zonal 

sum  or  mean  or  maximum  of  field  values  in  each  zone 

Table  2:  Data  Model  and  Operations 


Figure  4:  Modeling  Geographic  Information(Source:  [75]) 


characteristics  are  grouped  into  classes  called  feature  types.  Direction  is  another  important  feature  used  in 
spatial  applications.  A  direction  feature  can  be  modeled  as  a  spatial  object  [125].  Research  has  also  been 
done  to  efficiently  compute  the  cardinal  direction  relations  between  regions  that  are  composed  of  sets  of 
spatial  objects  [135]. 

Query  Languages 

When  it  comes  to  database  sytems,  spatial  database  researchers  prefer  object-based  models  because  the 
data  types  provided  by  object-based  database  sytems  can  be  extended  to  spatial  data  types  by  creating 
abstract  data  types  (ADT).  OGIS  provides  a  framework  for  object-based  models.  Figure  4  shows  the 
OpenGIS  approach  to  modeling  geographic  features.  This  framework  provides  conceptual  schemas  to  define 
abstract  feature  types  and  provides  facilities  to  develop  application  schemas  that  can  capture  data  about 
feature  instances.  Geographic  phenomena  fall  into  two  broad  categories,  discrete  and  continuous.  Discrete 
phenomena  are  objects  that  have  well-defined  boundaries  or  spatial  extent,  examples  being  buildings  and 
streams.  Continuous  phenomena  vary  over  space  and  have  no  specific  extent  (e.g.,  temperature, elevation). 
A  continuous  phenomenon  is  described  in  terms  of  its  value  at  a  specific  position  in  space  (and  possibly 
time).  OGIS  represents  discrete  phenomena  (also  called  vector  data)  by  a  set  of  one  or  more  geometric 
primitives  (points,  curves,  surfaces,  or  solids).  A  continuous  phenomenon  is  represented  through  a  set  of 
values,  each  associated  with  one  of  the  elements  in  an  array  of  points.  OGIS  uses  the  term  ’’coverage”  to 
refer  to  any  data  representation  that  assigns  values  directly  to  spatial  position.  A  coverage  is  a  function 
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|  Basic  Functions  | 

SpatialReference() 

Returns  the  underlying  coordinate  system  of  the  geometry 

EnvelopeQ 

Returns  the  minimum  orthogonal  bounding  rectangle  of  the  geometry 

Export  () 

Returns  the  geometry  in  a  different  representation 

IsEmpty() 

Returns  true  if  the  geometry  is  an  empty  set. 

Returns  true  if  the  geometry  is  simple  (no  self-intersection) 

|  Boundary^) 

Returns  the  boundary  of  the  geometry 

|  Topological/  Set  Operators  j 

Equal 

Returns  true  if  the  interior  and  boundary  of  the  two 
geometries  are  spatially  equal 

Disjoint 

Returns  true  if  the  boundaries  and  interior  do  not  intersect. 

Intersect 

Returns  true  if  the  interiors  of  the  geometries  intersect 

Touch 

Returns  true  if  the  boundaries  intersect  but  the  interiors  do  not. 

Cross 

Returns  true  if  the  interior  of  the  geometries  intersect  but  the 
boundaries  do  not 

Within 

Returns  true  if  the  interior  of  the  given  geometry  does  not  intersect 
with  the  exterior  of  another  geometry. 

Contains 

Tests  if  the  given  geometry  contains  another  given  geometry 

Overlap 

Returns  true  if  the  interiors  of  two  geometries 
have  non-empty  intersection 

|  Spatial  Analysis  | 

Distance 

Returns  the  shortest  distance  between  two  geometries 

Buffer 

Returns  a  geometry  that  consists  of  all  points 

whose  distance  from  the  given  geometry  is  less  than  or  equal  to  the 
specified  distance 

ConvexHull 

Returns  the  smallest  convex  set  enclosing  the  geometry 

Intersection 

Returns  the  geometric  intersection  of  two  geometries 

Union 

Returns  the  geometric  union  of  two  geometries 

Difference 

Returns  the  portion  of  a  geometry  which  does  not  intersect 
with  another  given  geometry 

SymmDiff 

Returns  the  portions  of  two  geometries  which  do 
not  intersect  with  each  other 

Table  3:  A  sample  of  operations  listed  in  the  OGIS  standard  for  SQL 


from  a  spatio-temporal  domain  to  an  attribute  domain.  OGIS  provides  standardized  representations  for 
spatial  characteristics  through  geometry  and  topology.  Geometry  provides  the  means  for  the  quantitative 
description  of  the  spatial  characteristics  including  dimension,  position,  size,  shape,  and  orientation.  Topology 
deals  with  the  characteristics  of  geometric  figures  that  remain  invariant  if  the  space  is  deformed  elastically 
and  continuously.  Figure  5  shows  the  hierarchy  of  geometry  data  types.  Objects  under  Primitive  will  be 
open  (i.e.,  they  will  not  contain  their  boundary  points)  and  the  objects  under  Complex  will  be  closed. 

In  addition  to  defining  the  spatial  data  types,  OGIS  also  defines  spatial  operations.  Table  3  lists  basic 
operations  operative  on  all  spatial  data  types.  The  topological  operations  are  based  on  the  ubiquitous  nine- 
intersection  model.  Using  the  OGIS  specification,  common  spatial  queries  can  be  intuitively  posed  in  SQL. 
For  example,  the  query  Find  all  lakes  which  have  an  area  greater  than  20  sq.  km.  and  are  within  50  km. 
from  the  campgrounds  can  be  posed  as  shown  in  Table  4  and  Figure  6.  Other  example  GIS  and  LBS 
queries  are  provided  in  Table  5.  The  OGIS  specification  is  confined  to  topological  and  metric  operations  on 
vector  data  types.  Also,  several  spatio-temporal  query  languages  have  been  studied  that  are  trigger  based 
for  relational-oriented  models  [12],  moving  objects  [15],  future  temporal  languages  [10],  and  constraint-based 
query  languages  [13]. 
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Figure  5:  Hierarchy  of  Data  Types 

SELECT  L.name 

FROM  Lake  L,  Facilities  Fa 

WHERE  Area(L. Geometry)  >  20  AND 

Fa. name  =  ‘campground’  AND 

Distance  (Fa.  Geometry,  L.  Geometry)  <  50 

Table  4:  SQL  Query  with  spatial  operators 

For  spatial  networks,  commonly  used  spatial  data  types  include  objects  such  as  Node,  Edge,  and  Graph. 
They  may  be  constructed  as  an  ADT  in  a  database  system.  Query  languages  based  on  relational  algebra  are 
unable  to  express  certain  important  graph  queries  without  making  certain  assumptions  about  the  graphs. 
For  example,  the  transitive  closure  of  a  graph  may  not  be  determined  using  relational  algebra.  In  the  SQL3, 
a  recursion  operation  RECURSIVE  has  been  proposed  to  handle  the  transitive  closure  operation. 

4.2  Research  Needs 

Map  algebra 

Map  Algebra  [43]  is  a  framework  for  raster  analysis  that  has  now  evolved  to  become  a  preeminent  language 
for  dealing  with  field-based  models.  Multiple  operations  can  be  performed  that  take  multiple  data  layers  that 
are  overlayed  upon  each  other  to  create  a  new  layer.  Some  common  groups  of  operations  include  local,  focal, 
and  zonal.  However,  research  is  needed  to  account  for  the  generalization  of  temporal  or  higher  dimensional 
datasets  (e.g.,  3D  data). 
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Figure  6:  SQL  query  tree 


]  GIS  Queries  | 

Grouping 

Recode  all  land  with  silty  soil  to  silt-loadm  soil 

Isolate 

Select  all  land  owned  by  Steve  Steiner 

Classify 

If  the  population  density  is  less  than  100  people  /  sq.  mi.,  land  is  acceptable 

Scale 

Change  all  measurement’s’  to  the  metric  system 

Rank 

If  the  road  is  an  Interstate,  assign  it  code  1;  if  the  road 

is  a  state  or  US  highway,  assign  it  code  2;  otherwise  assign  it  code  3 

Evaluate 

If  the  road  code  is  1,  then  assign  it  Interstate;  if  the  road  code  is  2, 

then  assign  it  Main  Artery;  if  the  road  code  is  3,  assign  it  Local  Road 

Rescale 

Apply  a  function  to  the  population  density 

Attribute  Join 

Join  the  Forest  layer  with  the  layer  containing  forest-cover  codes 

Zonal 

Produce  a  new  map  showing  state  populations  given  county  population 

Registration 

Align  two  layers  to  a  common  grid  reference 

Spatial  Join 

Overlay  the  land-use  and  vegetation  layers  to  produce  a  new  layer 

|  LBS  Queries  | 

Nearest  Neighbor 

List  the  nearest  gas  stations 

Directions 

Display  directions  from  a  source  to  a  destation 
(e.g.  Google  Maps,  Map  Quest) 

Local  Search 

Search  for  restaurants  in  the  neighborhood 
(e.g.  Microsoft  Live  Local,  Google  Local) 

Table  5:  Typical  Spatial  Queries  from  GIS  and  LBS 


Modeling  3D  Data 

The  representation  of  volumetric  data  is  another  held  to  be  researched.  Geographic  attributes  such  as 
Clouds,  Emissions,  Vegetation,  etc.  are  best  described  as  point  fields  on  volumetric  bounds.  Sensor  data  from 
sensor  technologies  such  as  LADAR  (Laser  Detection  and  Ranging),  3D  SAR  (Synthtic  Arperture  Radar), 
and  EM  collect  data  volumetrically.  Since  volumetric  data  is  huge,  current  convention  is  to  translate  the 
data  into  lower  dimensional  representations  such  as  B-reps,  Point  clouds,  NURBS,  etc.  This  results  in  loss  of 
intrinsic  3-dimensional  information.  Efforts  [77]  have  been  made  to  develop  three  dimensional  data  models 
that  emphasize  the  significance  of  the  volumetric  shapes  of  physical  world  objects.  This  topological  3D  data 
model  relies  on  Poincare  algebra.  The  internal  structure  is  based  on  a  network  of  simplexes,  and  the  internal 
data  structure  used  is  a  Tetrahedronized  Irregular  Network  (TIN)  [26,  77],  which  is  the  three-dimensional 
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variant  of  the  well-known  Triangulated  Irregular  Network  (TIN). 


Modeling  Spatial  Temporal  Networks  Graphs  have  been  extensively  used  to  represent  spatial  networks. 
Considering  the  time-dependence  of  the  network  parameters  and  their  topology,  it  has  become  critically 
important  to  incorporate  the  temporal  nature  of  these  networks  into  their  models  to  make  them  more 
accurate  and  effective.  For  example,  in  a  transportation  network  the  travel  times  on  road  segments  are 
often  dependent  on  the  time  of  the  day  and  there  can  be  intervals  when  certain  road  segments  are  not 
available  for  service.  In  such,  time-dependent  networks  modeling  the  time  variance  becomes  very  important. 
Time  expanded  graphs  [82]  and  time  aggregated  graphs  [57]  have  been  used  to  model  time  varying  spatial 
networks.  In  the  time  expanded  representation,  a  copy  of  the  entire  network  is  maintained  for  every  time 
instant,  whereas  the  time  aggregated  graphs  maintain  a  time  series  of  attributes,  associated  to  every  node 
and  edge. 

Network  modeling  can  be  further  extended  to  consider  3D  spatial  data.  Standard  road  network  features 
do  not  represent  3D  structure  and  material  properties.  For  instance,  while  modeling  a  road  tunnel,  we 
might  want  to  represent  its  overpass  clearance  as  a  spatial  property.  Such  properties  will  help  take  spatial 
constraints  into  account  while  selecting  routes. 

Modeling  Moving  Objects  A  moving  object  database  is  considered  to  be  a  spatio-temporal  database  in 
which  the  spatial  objects  may  change  their  position  and  extent  over  a  period  of  time.  To  cite  a  few  examples, 
the  movement  of  taxi  cabs,  the  path  of  a  hurricane  over  a  period  of  time,  and  geographic  profiling  of  serial 
criminals  are  a  few  examples  where  a  moving  objects  database  may  be  considered.  [62,  47]  have  provided  a 
data  model  to  support  the  design  of  such  databases. 

Markup  Languages  The  goals  of  markup  languages,  such  as  Geography  Markup  Language  (GML)  [58], 
are  to  provide  a  standard  for  modeling  language  and  data  exchange  formats  for  geographic  data.  GML  is  an 
XML  based  markup  language  to  represent  geographic  entities  and  the  relationships  between  them.  Entities 
associated  with  geospatial  data  such  as  geometry,  coordinate  systems,  attributes,  etc.  can  be  represented  in 
a  standard  way  using  GML.  CityGML  [16]  is  a  subclass  of  GML  useful  for  representing  3D  urban  objects, 
such  as  buildings,  bridges,  tunnels,  etc.  CityGML  allows  modeling  of  spatial  data  at  different  levels  of  detail 
regarding  both  geometry  and  thematic  differentiation.  It  can  be  used  to  model  2.5D  data  (e.g.,  digital  terrain 
model),  as  well  as  3D  data  (walkable  architecture  model).  Keyhole  Markup  Language  (KML)  [81]  is  another 
XML  based  markup  language  popular  with  commercial  spatial  software  from  Google.  Based  on  a  structure 
similar  to  GML,  KML  allows  representation  of  points,  polygons,  3D  objects,  attributes,  etc. 
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Figure  7:  Entity  relationship  diagrams  for  common  representations  of  spatial  data 

5  Spatial  Query  Processing 

5.1  Accomplishments 

The  efficient  processing  of  spatial  queries  requires  both  efficient  representation  and  efficient  algorithms. 
Common  representations  of  spatial  data  in  an  object  model  include  spaghetti,  the  node-arc-node  (NAA) 
model,  the  doubly  connected-edge- list  (DCEL),  and  boundary  representation,  some  of  which  are  shown  in 
Figure  7  using  entity-relationship  diagrams.  The  NAA  model  differentiates  between  the  topological  concepts 
(node,  arc,  areas)  and  the  embedding  space  (points,  lines,  areas).  The  spaghetti-ring  and  DCEL  focus  on  the 
topological  concepts.  The  representation  of  the  field  data  model  includes  a  regular  tessellation  (triangular, 
square,  hexagonal  grid),  as  well  as  triangular  irregular  networks  (TIN). 

Query  processing  in  spatial  databases  differs  from  that  of  relational  databases  because  of  the  following 
three  major  issues: 

•  Unlike  relational  databases,  spatial  databases  have  no  fixed  set  of  operators  that  serve  as  building 
blocks  for  query  evaluation. 


•  Spatial  databases  deal  with  extremely  large  volumes  of  complex  objects.  These  objects  have  spatial 
extensions  and  cannot  be  naturally  sorted  in  a  one-dinrensional  array. 
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•  Computationally  expensive  algorithms  are  required  to  test  for  spatial  predicates,  and  the  assumption 
that  I/O  costs  dominate  processing  costs  in  the  CPU  is  no  longer  valid. 


In  this  section,  we  describe  the  processing  techniques  for  evaluating  queries  on  spatial  databases,  and 
discuss  open  problems  in  spatial  query  processing  and  query  optimization. 

5.1.1  Spatial  Query  Operations 

Spatial  query  operations  can  be  classified  into  four  groups  [56]. 

•  Update  Operations:  These  include  standard  database  operations  such  as  modify,  create,  delete. 

•  Spatial  Selection:  These  can  be  of  two  types: 

—  Point  Query:  Given  a  query  point,  find  all  spatial  objects  that  contain  it.  An  example  is  the 
following  query,  “Find  all  river  flood-plains  which  contain  the  SHRINE.” 

—  Regional  Query:  Given  a  query  polygon,  find  all  spatial  objects  which  intersect  the  query 
polygon.  When  the  query  polygon  is  a  rectangle,  this  query  is  called  a  window  query.  These 
queries  are  sometimes  also  referred  to  as  range  queries.  An  example  query  could  be  “Identify  the 
names  of  all  forest  stands  that  intersect  a  given  window.” 

—  Spatial  Join:  Like  the  join  operator  in  relational  databases,  the  spatial  join  is  one  of  the  more 
important  operators.  When  two  tables  are  joined  on  a  spatial  attribute,  the  join  is  called  a  spatial 
join.  A  variant  of  the  spatial  join  and  an  important  operator  in  GIS  is  the  map  overlay.  This 
operation  combines  two  sets  of  spatial  objects  to  form  new  ones.  The  “boundaries”  of  a  set  of  these 
new  objects  are  determined  by  the  non-spatial  attributes  assigned  by  the  overlay  operation.  For 
example,  if  the  operation  assigns  the  same  value  of  the  non-spatial  attribute  to  two  neighboring 
objects,  then  the  objects  are  “merged” .  Some  examples  of  spatial  join  predicates  are  intersect, 
contains,  is-enclosed-by,  distance,  northwest,  adjacent,  meets,  overlap.  A  query  example  of  a 
spatial  join  is  “Find  all  forest-stands  and  river  flood-plains  which  overlap”. 

—  Spatial  Aggregate:  An  example  of  a  spatial  aggregate  is  “Find  the  river  closest  to  a  camp¬ 
ground”.  Spatial  aggregates  are  usually  variants  of  the  Nearest  Neighbor  [70,  116,  109]  search 
problem:  given  a  query  object,  find  the  object  having  minimum  distance  from  the  query  object. 
A  Reverse  Nearest  Neighbor  (RNN)  [83,  78,  139,  104,  156]  query  is  another  example  of  a  spatial 
aggregate.  Given  a  query  object,  a  RNN  Query  finds  objects  for  which  the  query  object  is  the 
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nearest  neighbor.  Applications  of  RNN  include  army  strategic  planning  where  a  medical  unit,  A, 
in  the  battlefield  is  always  in  search  of  a  wounded  soldier  for  whom  A  is  the  nearest  medical  unit. 

5.1.2  Visibility  Queries 

Visibility  has  been  widely  studied  in  Computer  Graphics.  Visibility  may  be  defined  as  the  parts  of  objects 
and  the  environment  that  are  visible  from  a  point  in  space.  A  visibility  query  can  be  thought  of  as  a  query 
that  returns  the  objects  and  part  of  the  environment  visible  at  the  querying  point.  For  example,  within 
a  city,  if  the  coverage  area  of  a  wireless  antenna  is  considered  to  be  the  visible  area,  then  the  union  of 
coverage  areas  of  all  the  antennas  in  the  city  will  provide  an  idea  about  the  area  which  is  not  covered.  Such 
information  may  be  used  to  strategically  place  a  new  antenna  at  an  optimal  location.  In  a  visibility  query, 
if  the  point  in  space  moves,  the  area  of  visibility  changes.  Such  a  query  may  be  called  a  continuous  visibility 
query.  For  example,  security  for  the  President’s  motorcade  involves  cordoning  off  the  buildings  which  have 
route  visibility.  In  such  a  case,  the  visibility  query  may  be  thought  of  as  a  query  that  returns  the  buildings 
visible  at  different  points  on  the  route. 

5.1.3  Visual  Querying 

Many  spatial  applications  present  results  visually,  in  the  form  of  maps  which  consist  of  graphic  images,  3D 
displays,  and  animations.  These  applications  allow  users  to  query  the  visual  representation  by  pointing  to 
the  visual  representation  using  pointing  devices  such  as  a  mouse  or  a  pen.  Such  graphical  interfaces  are 
needed  to  query  spatial  data  without  the  need  by  users  to  write  any  SQL  statements.  In  recent  years,  map 
services,  such  as  Google  Earth  and  Microsoft  Earth,  have  become  very  popular.  Further  work  is  needed  to 
explore  the  impact  of  querying  by  pointing  and  visual  presentation  of  results  on  database  performance. 

5.1.4  Two-Step  Query  Processing  of  Spatial  Operations 

Since  spatial  query  processing  involves  complex  data  types,  a  lake  boundary  might  need  a  thousand  vertices 
for  exact  representation.  Spatial  operations  typically  follow  a  two-step  algorithm  ( filter  and  refinement) 
as  shown  in  Figure  8  to  efficiently  process  complex  spatial  objects  [37].  Approximate  geometry  such  as  the 
minimal  orthogonal  bounding  rectangle  of  an  extended  spatial  object  is  first  used  to  filter  out  many  irrelevant 
objects  quickly.  Exact  geometry  is  then  used  for  the  remaining  spatial  objects  to  complete  the  processing. 

•  Filter  step:  In  this  step,  the  spatial  objects  are  represented  by  simpler  approximations  like  the 
minimum  bounding  rectangle(MBR).  For  example,  consider  the  following  point  query,  “Find  all  rivers 
whose  flood-plains  overlap  the  SHRINE” .  In  SQL  this  will  be: 


18 


filter  step 


refinement  step 


SELECT  river,  name 
FROM  river 

WHERE  overlap  (river  .flood-plain,  :  SHRINE) 

If  we  approximate  the  flood-plains  of  all  rivers  with  MBRs,  then  it  is  less  expensive  to  determine 
whether  the  point  is  in  a  MBR  than  to  check  if  a  point  is  in  an  irregular  polygon,  that  is,  in  the  exact 
shape  of  the  flood-plain.  The  answer  from  this  approximate  test  is  a  superset  of  the  real  answer  set. 
This  superset  is  sometimes  called  the  candidate  set.  Even  the  spatial  predicate  may  be  replaced  by  an 
approximation  to  simplify  a  query  optimizer.  For  example,  touch(river. flood-plain,  :SHRINE)  may  be 
replaced  by  overlap(MBR(river. flood-plain,  :SHRINE),  and  MBR(:SHRINE))  in  the  filter  step.  Many 
spatial  operators,  for  example,  inside,  north-of  and  buffer,  can  be  approximated  using  the  overlap 
relationship  among  corresponding  MBRs.  Such  a  transformation  guarantees  that  no  tuple  from  the 
final  answer  using  exact  geometry  is  eliminated  in  the  filter  step. 

•  Refinement  step:  Here,  the  exact  geometry  of  each  element  from  the  candidate  set  and  the  exact 
spatial  predicate  is  examined.  This  usually  requires  the  use  of  a  CPU-intensive  algorithm.  This  step 
may  sometimes  be  processed  outside  the  spatial  database  in  an  application  program  such  as  GIS,  using 
the  candidate  set  produced  by  the  spatial  database  in  the  filter  step. 

5.1.5  Techniques  for  Spatial  Operations 

This  section  presents  several  common  operations  between  spatial  objects:  selection,  spatial  join,  aggregates, 
and  bulk  loading. 
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Selection  Operation 

Similar  to  traditional  database  systems,  the  selection  operation  can  be  performed  on  indexed  or  non- 
indexed  spatial  data.  The  difference  is  in  the  technique  used  to  evaluate  the  predicate  and  the  type  of 
index.  As  discussed  in  the  previous  section,  a  two-step  approach,  where  the  geometry  of  a  spatial  object 
is  approximated  by  a  rectangle,  is  commonly  used  to  evaluate  a  predicate.  Popular  indexing  techniques 
for  spatial  data  are  R-tree,  and  space-filling  curves.  An  R-tree  is  a  height-balanced  tree  that  is  a  natural 
extension  of  a  B-tree  for  fc-dimensions.  It  allows  a  point  search  to  be  processed  in  0{log  n )  time.  Hash 
filling  curves  provide  one-to-one  continuous  mappings  which  map  points  of  multi-dimensional  space  into  one- 
dimensional  space.  This  allows  the  user  to  impose  order  on  higher  dimensional  spaces.  Common  examples 
of  space-filling  curves  are  row-order  Peano,  Z-order,  and  Hilbert  curves.  Once  the  data  has  been  ordered 
by  a  space-filling  curve,  a  B-tree  index  can  be  imposed  on  the  ordered  entries  to  enhance  the  search.  Point 
search  operations  can  be  performed  in  0{log  n)  time. 

Spatial  Join  Operation 

Conceptually  a  join  is  defined  as  a  cross-product  followed  by  a  selection  condition.  In  practice,  this 
viewpoint  can  be  very  expensive,  because  it  involves  materializing  the  cross-product  before  applying  the 
selection  criterion.  This  is  especially  true  for  spatial  databases.  Many  ingenious  algorithms  have  been 
proposed  to  preempt  the  need  to  perform  the  cross-product.  The  two-step  query  processing  technique 
described  in  the  previous  section  is  the  most  commonly  used.  With  such  methods,  the  spatial  join  operation 
can  be  reduced  to  a  rectangle-rectangle  intersection,  the  cost  of  which  is  relatively  modest  compared  to  the 
I/O  cost  of  retrieving  pages  from  secondary  memory  for  processing. 

A  number  of  strategies  have  been  proposed  for  processing  spatial  joins.  Interested  readers  are  encouraged 
to  refer  to  [27,  97,  90,  130,  159]. 

Aggregate  Operation:  Nearest  Neighbor,  Reverse  Nearest  Neighbor 

Nearest  Neighbor  queries  are  common  in  many  applications.  For  example,  a  person  driving  on  the  road 
may  want  to  find  the  nearest  gas  station  from  his  current  location.  Various  algorithms  exist  for  nearest 
neighbor  queries  [70,  116,  109,  73,  157].  Techniques  based  on  Voronoi  diagrams,  Quad-tree  indexing,  and 
Kd-trees  have  been  discussed  in  [119] 

Reverse  Nearest  Neighbor  queries  were  introduced  in  [83]  in  the  context  of  decision  support  systems.  For 
example,  a  RNN  query  can  be  used  to  find  a  set  of  customers  who  can  be  influenced  by  the  opening  of  a 
new  store  outlet  location. 
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5.1.6  Bulk  Loading 


Bulk  operations  affect  potentially  a  large  set  of  tuples,  unlike  other  database  operations,  such  as  insert  into 
a  relation,  which  affects  possibly  one  tuple  at  a  time.  Bulk  loading  refers  to  the  creation  of  an  index  from 
scratch  on  a  potentially  large  set  of  data.  Bulk  loading  has  its  advantages  because  the  properties  of  the  data 
set  may  be  known  in  advance.  These  properties  may  be  used  to  efficiently  design  the  space-partitioning  index 
structures  commonly  used  for  spatial  data.  An  evaluation  of  generic  bulk  loading  techniques  is  provided  in 

[45]. 

5.1.7  Parallel  GIS 

A  High  Performance  Geographic  Information  System  (HPGIS)  is  a  central  component  of  many  interactive 
applications  like  real-time  terrain  visualization,  situation  assessment,  and  spatial  decision-making.  The 
Geographic  Information  System  (GIS)  often  contains  large  amounts  of  geometric  and  feature  data  (e.g. 
location,  elevation,  soil  type,  etc.)  represented  as  large  sets  of  points,  chains  of  line  segments,  and  polygons. 
This  data  is  often  accessed  via  range  queries.  The  existing  sequential  methods  for  supporting  GIS  operations 
do  not  meet  the  realtime  requirements  imposed  by  many  interactive  applications. 

Hence,  parallelization  of  GIS  is  essential  for  meeting  the  high  performance  requirements  of  several  real¬ 
time  applications.  A  GIS  operation  can  be  parallelized  either  by  function-partitioning  [20,  22,  137]  or 
by  data-partitioning  [28,  38,  55,  71,  72,  87,  152,  158,  126]  .  Function-partitioning  uses  specialized  data 
structures  (e.g.  distributed  data  structures)  and  algorithms  which  may  be  different  from  their  sequential 
counterparts.  Data-partitioning  techniques  divide  the  data  among  different  processors  and  independently 
execute  the  sequential  algorithm  on  each  processor.  Data-partitioning  in  turn  is  achieved  by  declustering 
[51,  94]  the  spatial  data.  If  the  static  declustering  methods  fail  to  equally  distribute  the  load  among  different 
processors,  the  load-balance  may  be  improved  by  redistributing  parts  of  the  data  to  idle  processors  using 
Dynamic  Load-Balancing  (DLB)  techniques. 

5.2  Research  Needs 

This  section  presents  the  research  needs  for  spatial  query  processing  and  query  optimization. 

Query  processing 

Many  open  research  areas  exist  at  the  logical  level  of  query  processing,  including  query-cost  modeling 
and  queries  related  to  fields  and  networks.  Cost  models  are  used  to  rank  and  select  the  promising  processing 
strategies,  given  a  spatial  query  and  a  spatial  data  set.  However,  traditional  cost  models  may  not  be  accurate 
in  estimating  the  cost  of  strategies  for  spatial  operations,  due  to  the  distance  metric  as  well  as  the  semantic 
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gap  between  relational  operators  and  spatial  operation.  Comparison  of  the  execution-costs  of  such  strategies 
required  that  new  cost  models  be  developed  to  estimate  the  selectivity  of  spatial  search  and  join  operations. 
Preliminary  work  in  the  context  of  the  R-tree,  tree-matching  join,  and  fractal- model  is  promising  [33,  147], 
but  more  work  is  needed. 

Many  processing  strategies  using  the  overlap  predicate  have  been  developed  for  range  queries  and  spatial 
join  queries.  However,  there  is  a  need  to  develop  and  evaluate  strategies  for  many  other  frequent  queries 
such  as  those  listed  in  Table  6.  These  include  queries  on  objects  using  predicates  other  than  overlap,  queries 
on  fields  such  as  slope  analysis,  and  queries  on  networks  such  as  the  shortest  path  to  a  set  of  destinations. 

Depending  on  the  type  of  spatial  data  and  the  nature  of  the  query,  other  research  areas  also  need  to  be 
investigated.  A  moving  objects  query  involves  spatial  objects  that  are  mobile.  Examples  of  such  queries 
include  “Which  is  the  nearest  taxi  cab  to  the  customer?” ,  “Where  is  the  hurricane  expected  to  hit  next?” , 
and  “What  is  a  possible  location  of  a  serial  criminal?”  With  the  increasing  availability  of  streaming  data 
from  GPS  devices,  continous  queries  has  become  an  active  area  of  research.  Several  techniques  [49,  61,  62] 
have  been  proposed  to  execute  such  queries. 

A  skyline  query  [36]  is  a  query  to  retrieve  a  set  of  interesting  points  (records)  from  a  potentially  huge 
collection  of  points  (records)  based  on  certain  attributes.  For  example,  considering  a  set  of  hotels  to  be 
points,  the  skyline  query  may  return  a  set  of  interesting  hotels  based  on  a  user’s  preferences.  The  set  of  hotels 
returned  for  a  user  who  prefers  cheap  hotel  may  be  different  from  the  set  of  hotels  returned  for  a  user  who 
prefers  hotels  which  are  closer  to  the  coast.  Research  needed  for  skyline  query  operation  includes  computation 
of  algorithms,  and  processing  for  higher  dimensions  (attributes).  Other  query  processing  techniques  where 
research  is  required  are  querying  on  3D  spatial  data  and  spatio-temporal  data. 

Query  optimization 

The  query  optimizer,  a  module  in  database  software,  generates  different  evaluation  plans  and  determines 
the  appropriate  execution  strategy.  Before  the  query  optimizer  can  operate  on  the  query,  the  high  level 


Table  6:  Difficult  Spatial  Queries  from  GIS 


Voronoize 

Classify  households  as  to  which  supermarket  they  are  closest  to 

Network 

Find  the  shortest  path  from  the  warehouse  to  all  delivery  stops 

T  imedependentnetwork 

Find  the  shortest  path  where  the  road  network  is  dynamic 

Allocation 

Where  is  the  best  place  to  build  a  new  restaurant 

T  vans  formation 

Triangulate  a  layer  based  on  elevation 

BulkLoad 

Load  a  spatial  data  file  into  the  database 

Raster  <->  Vector 

Convert  between  raster  and  vector  representations 

Visibility 

Find  all  points  of  objects  and  environment  visible  from  a  point 

EvacuationRoute 

Find  evacuation  routes  based  on  capacity  and  availability  constraints 

Predict  Location 

Predict  the  location  of  a  mobile  person  based  on  personal  route  patterns 
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declarative  statement  must  be  scanned  through  a  parser.  The  parser  checks  the  syntax  and  transforms 
the  statement  into  a  query  tree.  In  traditional  databases,  the  data  types  and  functions  are  fixed  and  the 
parser  is  relatively  simple.  Spatial  databases  are  examples  of  an  extensible  database  system  and  have 
provisions  for  user-defined  types  and  methods.  Therefore,  compared  to  traditional  databases,  the  parser 
for  spatial  databases  has  to  be  considerably  more  sophisticated  to  identify  and  manage  user-defined  data 
types  and  map  them  into  syntactically  correct  query  trees.  In  the  query  tree,  the  leaf  nodes  correspond  to 
the  relations  involved  and  the  internal  nodes  correspond  to  the  basic  operations  that  constitute  the  query. 
Query  processing  starts  at  the  leaf  nodes  and  proceeds  up  the  tree  until  the  operation  at  the  root  node  has 
been  performed. 

Consider  the  query,  “Find  all  lakes  which  have  an  area  greater  than  20  sq.  km.  and  are  within  50  km. 
from  the  campground.”  Let  us  assume  that  the  Area()  function  is  not  pre-computed  and  that  its  value  is 
computed  afresh  every  time  it  is  invoked.  A  query  tree  generated  for  the  query  is  shown  in  Figure  9  (a). 
In  the  classical  situation,  the  rule  “select  before  join”  would  dictate  that  the  Area  function  be  computed 
before  the  join  predicate  function,  Distance() (Figure  9  (b)),  the  underlying  assumption  being  that  the 
computational  cost  of  executing  the  select  and  join  predicate  is  equivalent  and  negligible  compared  to  the 
I/O  cost  of  the  operations.  In  the  spatial  situation,  the  relative  cost  per  tuple  of  Area()  and  Distance() 
is  an  important  factor  in  deciding  the  order  of  the  operations  [69].  Depending  upon  the  implementation  of 
these  two  functions,  the  optimal  strategy  may  be  to  process  the  join  before  the  select  operation(Figure  9  (c)). 
This  approach  thus  violates  the  main  heuristic  rule  for  relational  databases,  which  states  “Apply  select  and 
project  before  the  join  and  binary  operations”  are  no  longer  unconditional.  There  is  a  cost-based  optimization 
technique  to  determine  the  optimal  execution  strategy  from  a  set  of  execution  plans.  A  quantitative  analysis 
of  spatial  index  structures  is  used  to  calculate  the  expected  number  of  disk  accesses  that  are  required  to 
perform  a  spatial  query  [146].  Nevertheless,  in  spite  of  these  advances,  query  optimization  techniques  for 
spatial  data  need  further  study. 

6  Spatial  File  Organization  and  Indices 

6.1  Accomplishments 

6.1.1  Space-Filling  Curves 

The  physical  design  of  a  spatial  database  optimizes  the  instructions  to  storage  devices  for  performing  common 
operations  on  spatial  data  Hies.  File  designs  for  secondary  storage  include  clustering  methods  as  well  as 
spatial  hashing  methods.  Spatial  clustering  techniques  are  more  difficult  to  design  than  traditional  clustering 
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L.name 


Area(L.Geometry)  >  20 


Area(L.Geometry)  >  20 


Fa.name  =  ’campground’ 


J  Distance(Fa.Geometry,  L.Geometry)  <  50  Lake  L  ' 


Distance(Fa.Geometry,  L.Geometry)  <  50 


Fa.name  =  ’campground’  Area(L.Geometry)  >  20  i 


Distance(Fa.Geometry,  L.Geometry)  <  50 


Fa.name  =  ’campground’ 


Figure  9:  (a)  Query  tree  (b)  “pushing  down”:  select  operation  (c)  “pushing  down”  may  not  help 
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Pcano-Hilbcrt 


Morton  /  Z-ordcr 


Figure  10:  Space- filling  curves  to  linearize  a  multidimensional  space 

techniques  because  there  is  no  natural  order  in  multidimensional  space  where  spatial  data  resides.  This  is 
only  complicated  by  the  fact  that  the  storage  disk  is  a  logical  one-dimensional  device.  Thus,  what  is  needed 
is  a  mapping  from  a  higher  dimensional  space  to  a  one-dimensional  space  which  is  distance-preserving:  this 
ensures  that  elements  that  are  close  in  space  are  mapped  onto  nearby  points  on  the  line,  and  no  two  points  in 
the  space  are  mapped  onto  the  same  point  on  the  line  [29].  Several  mappings,  none  of  them  ideal,  have  been 
proposed  to  accomplish  this.  The  most  prominent  ones  include  row-order,  Z-order  and  the  Hilbert-curve 
(Figure  10). 

Metric  clustering  techniques  use  the  notion  of  distance  to  group  nearest  neighbors  together  in  a  metric 
space.  Topological  clustering  methods  like  connectivity  clustered  access  methods  [124]  use  the  min-cut  parti¬ 
tioning  of  a  graph  representation  to  efficiently  support  graph  traversal  operations.  The  physical  organization 
of  files  can  be  supplemented  with  indices,  which  are  data  structures  to  improve  the  performance  of  search 
operations. 

Classical  one-dinrensional  indices  such  as  the  B+-tree  can  be  used  for  spatial  data  by  linearizing  a  multi- 
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dimensional  space  using  a  space- filling  curve  such  as  the  Z-order.  A  large  number  of  spatial  indices  [119]  have 
been  explored  for  multidimensional  euclidean  space.  Representative  indices  for  point  objects  include  grid 
files,  multidimensional  grid  files  [89],  Point- Quad-Trees,  and  Kd-trees.  Representative  indices  for  extended 
objects  include  the  R-tree  family,  the  Field-tree,  Cell-tree,  BSP-tree,  and  Balanced  and  Nested  grid  files. 

6.1.2  Grid  Files 

Grid  files  were  introduced  by  Nievergelt  [106].  A  grid  file  divides  the  space  into  n-dimensional  spaces  which 
can  fit  into  equal-size  buckets.  The  structures  are  not  hierarchical  and  can  be  used  to  index  static  uniformly 
distributed  data.  However,  due  to  its  structure  the  directory  of  a  grid  file  can  be  so  sparse  and  large  that 
a  large  main  memory  is  required.  There  are  several  variations  of  grid  files  to  index  data  efficiently  and  to 
overcome  these  limitations  [108,  153].  An  overview  of  grid  files  is  given  in  [119]. 

6.1.3  Tree  indexes 

R-tree  aims  to  index  objects  in  a  hierarchical  index  structure  [64].  The  R-tree  is  a  height-balanced  tree 
which  is  the  natural  extension  of  the  B-tree  for  fc-dimensions.  Spatial  objects  are  represented  in  the  R-tree 
by  their  minimum  bounding  rectangle  (MBR).  Figure  11  illustrates  spatial  objects  organized  as  an  R-tree 
index.  R.-trees  can  be  used  to  process  both  point  and  range  queries. 

Several  variants  of  R,-trees  exist  for  better  performance  of  queries  and  storage  utilization.  The  R+-tree 
[123]  is  used  to  store  objects  by  avoiding  overlaps  among  the  MBRs,  which  increases  the  performance  of  the 
searching.  R*-trees  [31]  relies  on  the  combined  optimization  of  the  area,  margin,  and  overlap  of  each  MBR 
in  the  intermediate  nodes  of  the  tree,  which  results  in  better  storage  utilization. 

Many  R-tree  based  index  structures  [145,  117,  150,  105,  105,  118,  144]  have  been  proposed  to  index 
spatio-temporal  objects.  A  survey  of  spatio-temporal  access  methods  has  been  provided  in  [101]. 

Quad  tree  [52]  is  a  space-partitioning  index  structure  in  which  the  space  is  recursively  divided  into 
quads.  This  recursive  process  is  implemented  until  each  quad  is  homogeneous.  There  are  several  variations 
of  quad  trees  to  store  point  data,  raster  data,  and  object  data.  There  are  also  other  quad  tree  structures 
to  index  spatio-temporal  datasets,  such  as  Overlapping  Linear  Quad  Trees  [149],  and  Multiple  Overlapping 
Features  (MOF)  trees  [98]. 

The  Generalized  Search  Tree  (GiST)  [68]  provides  a  framework  to  build  almost  any  kind  of  tree  index 
on  any  kind  of  data.  Tree  index  structures,  such  as  _B+-tree  and  R.-tree,  can  be  built  using  GiST.  A  Spatial- 
Partitioning  Generalized  Search  Tree  (SP-GiST)  [25]  is  an  extensible  index  structure  for  space-partitioning 
trees.  Index  trees  such  as  Quad  tree,  and  fcd-tree  can  be  built  using  SP-GiST. 
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Figure  11:  Spatial  Objects  (d,  e,  f,  g,  h,  i)  arranged  in  an  R-tree  hierarchy 


6.1.4  Graph  indexes 

Most  of  the  spatial  access  methods  provide  methods  and  operators  for  point  and  range  queries  over  collections 
of  spatial  points,  line  segments,  and  polygons.  However,  it  is  not  clear  if  spatial  access  methods  can  efficiently 
support  network  computations  which  traverse  line-segments  in  a  spatial  network  based  on  connectivity 
rather  than  geographic  proximity.  A  Connectivity-Clustered  Access  Method  for  Spatial  Network 
(CCAM)  is  proposed  to  index  spatial  networks  based  on  graph  partitioning  [124]  by  supporting  network 
operations.  An  auxiliary  secondary  index,  such  as  B+-tree,  R-tree,  and  Grid  File,  is  used  to  support  network 
operations  such  as  Find{ ),  get- a- Success  or  (),  and  get-  Successor  s(). 

6.2  Research  Needs 

Concurrency  Control  The  R-link  tree  [84]  is  among  the  few  approaches  available  for  concurrency  control 
on  the  R,-tree.  New  approaches  for  concurrency-control  techniques  are  needed  for  other  spatial  indices.  Con¬ 
currency  is  provided  during  operations  such  as  search,  insert,  and  delete.  The  R-link  tree  is  also  recoverable 
in  a  write-ahead  logging  environment.  [85]  provides  general  algorithms  for  concurrency  control  for  GiST 
that  can  also  be  applied  to  tree-based  indexes.  Research  is  required  for  concurrency  control  on  other  useful 
spatial  data  structures. 
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7  Trends:  Spatial  Data  Mining 

7.1  Accomplishments 

The  explosive  growth  of  spatial  data  and  widespread  use  of  spatial  databases  emphasize  the  need  for  the 
automated  discovery  of  spatial  knowledge.  Spatial  data  mining  is  the  process  of  discovering  interesting 
and  previously  unknown,  but  potentially  useful  patterns  from  spatial  databases.  Some  of  the  applications 
are:  location-based  services,  studying  the  effects  of  climate,  land-use  classification,  predicting  the  spread  of 
disease,  creating  high  resolution  three-dimensional  maps  from  satellite  imagery,  finding  crime  hot  spots,  and 
detecting  local  instability  in  traffic.  A  detailed  review  of  spatial  data  mining  can  be  found  in  [134]. 

The  requirements  of  mining  spatial  databases  are  different  from  those  of  mining  classical  relational 
databases.  The  difference  between  classical  and  spatial  data  mining  parallels  the  difference  between  classical 
and  spatial  statistics.  One  of  the  fundamental  assumptions  that  guides  statistical  analysis  is  that  the  data 
samples  are  independently  generated,  as  with  successive  tosses  of  a  coin,  or  the  rolling  of  a  die.  When  it 
comes  to  the  analysis  of  spatial  data,  the  assumption  about  the  independence  of  samples  is  generally  false. 
In  fact,  spatial  data  tends  to  be  highly  self-correlated.  For  example,  changes  in  natural  resources,  wildlife, 
and  temperature  vary  gradually  over  space.  The  notion  of  spatial  autocorrelation,  the  idea  that  similar 
objects  tend  to  cluster  in  geographic  space,  is  unique  to  spatial  data  mining. 

For  detailed  discussion  of  spatial  analysis,  readers  are  encouraged  to  refer  to  [18,  66]. 

7.1.1  Spatial  Patterns 

This  section  presents  several  spatial  patterns,  specifically  those  related  to  location  prediction,  Markhov 
random  fields,  spatial  clustering,  spatial  outliers,  and  spatial  co-location. 

Location  Prediction 

Location  prediction  is  concerned  with  the  discovery  of  a  model  to  infer  locations  of  a  spatial  phenom¬ 
enon  from  the  maps  of  other  spatial  features.  For  example,  ecologists  build  models  to  predict  habitats  for 
endangered  species  using  maps  of  vegetation,  water  bodies,  climate,  and  other  related  species.  Figure  12 
shows  the  learning  dataset  used  in  building  a  location  prediction  model  for  red- winged  blackbirds  in  the  Darr 
and  Stubble  wetlands  on  the  shores  of  Lake  Erie  in  Ohio.  The  dataset  consists  of  nest  location,  vegetation 
durability,  distance  to  open  water  and  water  depth  maps.  Spatial  data  mining  techniques  that  capture  the 
spatial  auto-correlation  [76,  136]  of  nest  location  such  as  the  Spatial  Autoregression  Model  (SAR)  and 
Markov  Random  Fields  (MRF)  are  used  for  location  prediction  modeling. 

Spatial  Autoregression  Model 

Linear  regression  models  are  used  to  estimate  the  conditional  expected  value  of  a  dependent  variable 
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(c)  Water  Depth 


(d)  Distance  to  Open  Water 


Figure  12:  (a)  Learning  dataset:  The  geometry  of  the  Darr  wetland  and  the  locations  of  the  nests,  (b)  The 
spatial  distribution  of  vegetation  durability  over  the  marshland,  (c)  The  spatial  distribution  of  water  depth , 
and  (d)  The  spatial  distribution  of  distance  to  open  water. 


y  given  the  values  of  other  variables  X.  Such  a  model  assumes  that  the  variables  are  independent.  The 
Spatial  Autoregression  Model  [18,  60,  92,  127]  is  an  extension  of  the  linear  regression  model  that  takes 
spatial  autocorrelation  into  consideration.  If  the  dependent  values  y  and  X  are  related  to  each  other,  then 
the  regression  equation  [23]  can  be  modified  as 


y  =  pW  y  +  X/3  +  e  (1) 

Here  W  is  the  neighborhood  relationship  contiguity  matrix  and  p  is  a  parameter  that  reflects  the  strength 
of  the  spatial  dependencies  between  the  elements  of  the  dependent  variable.  Notice  that  when  p  —  0,  this 
equation  collapses  to  the  linear  regression  model.  If  the  spatial  autocorrelation  coefficient  is  statistically 
significant,  then  SAR  will  quantify  the  presence  of  spatial  autocorrelation.  In  such  a  case,  the  spatial 
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autocorrelation  coefficient  will  indicate  the  extent  to  which  variations  in  the  dependent  variable  ( y )  are 
explained  by  the  average  of  neighboring  observation  values. 

Markov  Random  Field 

Markov  Random  Field-based  [93]  Bayesian  classifiers  estimate  the  classification  model,  fc,  using  MRF 
and  Bayes’  rule.  A  set  of  random  variables  whose  interdependency  relationship  is  represented  by  an  undi¬ 
rected  graph  (i.e. ,  a  symmetric  neighborhood  matrix)  is  called  a  Markov  Random  Field.  The  Markov  property 
specifies  that  a  variable  depends  only  on  its  neighbors  and  is  independent  of  all  other  variables.  The  loca¬ 
tion  prediction  problem  can  be  modeled  in  this  framework  by  assuming  that  the  class  label,  li  =  fc(si), 
of  different  locations,  s*,  constitutes  an  MRF.  In  other  words,  random  variable  li  is  independent  of  li  if 
W(si,  Sj )  =  0. 

The  Bayesian  rule  can  be  used  to  predict  li  from  feature  value  vector  X  and  neighborhood  class  label 
vector  Li  as  follows: 


Pr(k\X,Li) 


PrjXll^L^PriklLi) 

Pr(X) 


(2) 


The  solution  procedure  can  estimate  Pr{li\Li)  from  the  training  data,  where  Li  denotes  a  set  of  labels 
in  the  neighborhood  of  Si  excluding  the  label  at  s*.  It  does  this  by  examining  the  ratios  of  the  frequencies  of 
class  labels  to  the  total  number  of  locations  in  the  spatial  framework.  Pr{X\li,  Li)  can  be  estimated  using 
kernel  functions  from  the  observed  values  in  the  training  dataset. 

A  more  detailed  theoretical  and  experimental  comparison  of  these  methods  can  be  found  in  [50] .  Although 
MRF  and  SAR  classification  have  different  formulations,  they  share  a  common  goal,  estimating  the  posterior 
probability  distribution.  However,  the  posterior  probability  for  the  two  models  is  computed  differently  with 
different  assumptions.  For  MRF,  the  posterior  is  computed  using  Bayes’  rule,  while,  in  SAR,  the  posterior 
distribution  is  directly  fitted  to  the  data. 

Spatial  Clustering 

Spatial  clustering  is  a  process  of  grouping  a  set  of  spatial  objects  into  clusters  so  that  objects  within  a 
cluster  have  high  similarity  in  comparison  to  one  another,  but  are  dissimilar  to  objects  in  other  clusters. 
For  example,  clustering  is  used  to  determine  the  “hot  spots”  in  crime  analysis  and  disease  tracking.  Many 
criminal  justice  agencies  are  exploring  the  benefits  provided  by  computer  technologies  to  identify  crime  hot 
spots  in  order  to  take  preventive  strategies  such  as  deploying  saturation  patrols  in  hot  spot  areas. 

Spatial  clustering  can  be  applied  to  group  similar  spatial  objects  together;  the  implicit  assumption  is  that 
patterns  in  space  tend  to  be  grouped  rather  than  randomly  located.  However,  the  statistical  significance  of 
spatial  clusters  should  be  measured  by  testing  the  assumption  in  the  data.  One  of  the  methods  to  compute 
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Figure  13:  Spatial  outlier  (Station  ID  9)  in  traffic  volume  data 

this  measure  is  based  on  quadrats  (i.e. ,  well  defined  areas,  often  rectangular  in  shape).  Usually  quadrats  of 
random  location  and  orientations  in  the  quadrats  are  counted,  and  statistics  derived  from  the  counters  are 
computed.  Another  type  of  statistics  is  based  on  distances  between  patterns;  one  such  type  is  Ripley’s  K- 
function  [44] .  After  the  verification  of  the  statistical  significance  of  the  spatial  clustering,  classical  clustering 
algorithms  [67]  can  be  used  to  discover  interesting  clusters. 

Spatial  Outliers 

A  spatial  outlier  [30]  is  a  spatially  referenced  object  whose  non-spatial  attribute  values  differ  significantly 
from  those  of  other  spatially  referenced  objects  in  its  spatial  neighborhood.  Figure  13  gives  an  example  of 
detecting  spatial  outliers  in  traffic  measurements  for  sensors  on  highway  I-35W  (North  bound)  for  a  24-hour 
time  period.  Station  9  seems  to  be  a  spatial  outlier  as  it  exhibits  inconsistent  traffic  flow  as  compared  with  its 
neighboring  stations.  The  reason  could  be  that  the  sensor  at  Station  9  is  malfunctioning.  Detecting  spatial 
outliers  is  useful  in  many  applications  of  geographic  information  systems  and  spatial  databases,  including 
transportation,  ecology,  public  safety,  public  health,  climatology,  and  location-based  services. 

Spatial  attributes  are  used  to  characterize  location,  neighborhood,  and  distance.  Non-spatial  attribute 
dimensions  are  used  to  compare  a  spatially  referenced  object  to  its  neighbors.  Spatial  statistics  literature  pro¬ 
vides  two  kinds  of  bi-partite  multidimensional  tests,  namely  graphical  tests  and  quantitative  tests.  Graphical 
tests,  which  are  based  on  the  visualization  of  spatial  data,  highlight  spatial  outliers  i.e.,  Variogram  clouds  [44] 
and  Moran  scatterplots  [96] .  Quantitative  methods  provide  a  precise  test  to  distinguish  spatial  outliers  from 
the  remainder  of  data.  A  unified  approach  to  detect  spatial  outliers  efficiently  is  discussed  in  [131].  [95] 

provides  algorithms  for  multiple  spatial  outlier  detection. 

Spatial  Co-location  The  co-location  pattern  discovery  process  finds  frequently  co-located  subsets  of  spatial 
event  types  given  a  map  of  their  locations.  For  example,  the  analysis  of  the  habitats  of  animals  and  plants 
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Figure  14:  Co-location  between  roads  and  rivers  in  a  hilly  terrain  (Courtesy:  Architecture  Technology 
Corporation) 

may  identify  the  co-locations  of  predator-prey  species,  symbiotic  species,  or  fire  events  with  fuel,  ignition 
sources  etc.  Figure  14  gives  an  example  of  the  co-location  between  roads  and  rivers  in  a  geographic  region. 

Approaches  to  discovering  co-location  rules  can  be  categorized  into  two  classes,  namely  spatial  statistics, 
and  data  mining  approaches.  Spatial  statistics-based  approaches  use  measures  of  spatial  correlation  to 
characterize  the  relationship  between  different  types  of  spatial  features.  Measures  of  spatial  correlation 
include  the  cross  if-function  with  Monte  Carlo  simulation,  mean  nearest-neighbor  distance,  and  spatial 
regression  models. 

Data  mining  approaches  can  be  further  divided  into  transaction-based  approaches  and  distance-based 
approaches.  Transaction-based  approaches  focus  on  defining  transactions  over  space  so  that  an  Apriori-like 
algorithm  can  be  used.  Transactions  over  space  can  be  defined  by  a  reference-feature  centric  model.  Under 
this  model,  transactions  are  created  around  instances  of  one  user-specified  spatial  feature.  The  association 
rules  are  derived  using  the  Apriori  [21]  algorithm.  The  rules  formed  are  related  to  the  reference  feature. 
However,  it  is  non-trivial  to  generalize  the  paradigm  of  forming  rules  related  to  a  reference  feature  to  the 
case  where  no  reference  feature  is  specified.  Also,  defining  transactions  around  locations  of  instances  of  all 
features  may  yield  duplicate  counts  for  many  candidate  associations. 

In  a  distance-based  approach  [102,  129,  74],  instances  of  objects  are  grouped  together  based  on  their 
Euclidean  distance  from  each  other.  This  approach  can  be  considered  to  be  an  event-centric  model  which 
finds  subsets  of  spatial  features  likely  to  occur  in  a  neighborhood  around  instances  of  given  subsets  of  event 
types. 
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7.2  Research  Needs 


This  section  presents  several  research  needs  in  the  area  of  spatio-temporal  data  mining  and  spatial-temporal 
network  mining. 

7.2.1  Spatio-temporal  Data  Mining 

Spatio-temporal  (ST)  data  mining  aims  to  develop  models  and  objective  functions  as  well  as  to  discover 
patterns  which  are  more  suited  to  spatio-temporal  databases  and  their  unique  properties  [115].  An  extensive 
survey  of  spatio-temporal  databases,  models  and  languages,  and  access  methods  can  be  found  in  [86].  A 
bibliography  of  spatio-temporal  data  mining  can  be  found  in  [114]. 

Spatio-temporal  pattern  mining  focuses  on  discovering  knowledge  that  is  frequently  located  together  in 
space  and  time.  [39,  41,  40]  defined  the  problems  of  discovering  mixed-drove  and  sustained  emerging  spatio- 
temporal  co-occurrence  patterns  and  proposed  interest  measures  and  algorithms  to  mine  such  patterns.  Other 
research  needs  include  conflation,  where  a  single  feature  is  obtained  from  several  sources  or  representations. 
The  goal  is  to  determine  the  optimal  or  best  representation  based  on  a  set  of  rules.  Problems  tend  to  occur 
during  maintenance  operations  and  cases  of  vertical  obstruction. 

In  several  application  domains  such  as  sensor  networks,  mobile  networks,  moving  object  analysis  and 
image  analysis,  the  need  for  Spatio-temporal  data  mining  is  increasing  drastically.  It  is  vital  to  develop 
new  models  and  techniques,  to  define  new  spatio-temporal  patterns,  and  to  formulize  monotonic  interest 
measures  to  mine  these  patterns  [113]. 

7.2.2  Spatio-temporal  Network  Mining 

In  the  post-9/11  world  of  asymmetric  warfare  in  urban  area,  many  human  activities  are  centered  about 
ST  infrastructure  networks,  such  as  transportation,  oil/gas-pipelines,  and  utilities  (e.g.  water,  electricity, 
telephone).  Thus,  activity  reports,  e.g.  crime/insurgency  reports,  may  often  use  network  based  location  ref¬ 
erences,  e.g.  street  address  such  as  ”200  Quiet  Street,  Scaryville,  RQ  91101”.  In  addition,  spatial  interaction 
among  activities  at  nearby  locations  may  be  constrained  by  network  connectivity  and  network  distances  (e.g. 
shortest  path  along  roads  or  train  networks)  rather  than  geometric  distances  (e.g.,  Euclidean  or  Manhattan 
distances)  used  in  traditional  spatial  analysis.  Crime  prevention  may  focus  on  identifying  subsets  of  ST 
networks  with  high  activity  levels,  understanding  underlying  causes  in  terms  of  ST-network  properties,  and 
designing  ST-network-control  policies. 

Existing  spatial  analysis  methods  face  several  challenges  (e.g.,  [121]).  First,  these  methods  do  not  model 
the  effect  of  explanatory  variables  to  determine  the  locations  of  network  hot  spots.  Second,  existing  methods 
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Figure  15:  Topics  driving  future  research  needs  in  spatial  database  systems 


for  network  pattern  analysis  are  computationally  expensive.  Third,  these  methods  do  not  consider  the 
temporal  aspects  of  the  activity  in  the  discovery  of  network  patterns.  For  example,  the  routes  used  by 
criminals  during  the  day  and  night  may  differ.  The  periodicity  of  bus/train  schedules  can  have  an  impact  on 
the  routes  traveled.  Incorporating  the  time-dependency  of  transportation  networks  can  improve  the  accuracy 
of  the  patterns. 


8  Summary 

In  this  chapter  we  presented  the  major  research  accomplishments  and  techniques  which  have  emerged  from 
the  area  of  spatial  databases  in  the  past  decade.  These  include  spatial  database  modeling,  spatial  query 
processing,  and  spatial  access  methods.  We  have  also  identified  areas  where  more  research  is  needed,  such 
as  spatio-temporal  databases,  spatial  data  mining,  and  spatial  networks. 

Figure  15  provides  a  summary  of  topics  which  continue  to  drive  the  research  needs  of  spatial  database 
systems.  Increasingly  available  spatial  data  in  the  form  of  digitized  maps,  remotely  sensed  images,  spatio- 
temporal  data  (for  example,  from  videos),  and  streaming  data  from  sensors  have  to  be  managed  and  processed 
efficiently.  New  ways  of  querying  techniques  to  visualize  spatial  data  in  more  than  one  dimension  are  needed. 
A  number  of  advances  have  been  made  in  computer  hardware  over  the  last  few  years,  but  many  have  yet  to 
be  fully  exploited,  including  increases  in  main  memory,  more  effective  storage  using  Storage  Area  Networks, 
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greater  availability  of  multi-core  processors,  and  powerful  graphic  processors.  A  huge  impetus  for  these 
advances  has  been  spatial  data  applications  such  as  land  navigation  systems  and  location  based  services. 
To  measure  the  quality  of  spatial  database  systems,  new  benchmarks  have  to  be  established.  Some  of  the 
benchmarks  [142,  110]  established  earlier  have  become  dated.  Newer  benchmarks  are  needed  to  characterize 
the  spatial  data  management  needs  of  other  systems  and  applications  such  as  spatio-temporal  databases, 
moving  objects  databases,  and  location  based  services. 
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